Before we starting analyzing the drought data, let’s first talk about the notion of working directories.
The working directory is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. You can see your current working directory at the top of the console (yours may be different from mine):
You can also see it through the getwd()
function.
As you get more experienced and start handling more projects, it’s a good idea to organize your projects into directories and, when working on a project, set the working directory to the project’s directory. That way, you will know where to find your project files and you won’t mix files from different projects together.
You can change the working directory using the setwd()
function. However, you won’t have to do this if you follow the instructions in the next section.
Click on File > New Project...
, then “New Directory” and “New Project”. (If a pop-up appears asking you if you want to save your current workspace (a .RData
file), click “Don’t Save”.) You should end up at a window something like the one below. Key in “Drought” for “Directory name”. As for “Create project as subdirectory of”, select a folder that makes sense. I keep all my files related to this class in a “STATS32” folder, so it makes sense to create the “Drought” project folder here.
Click on “Create Project”. If you check the current working directory now, you’ll see that it’s the directory you just created. Try exiting RStudio and opening it up again. You’ll see that you are back where you left off: in the Drought project folder.
To exit a project, click on File > Close Project
. To open a project, click on File > Open Project...
, and select the .Rproj file.
Download the drought dataset and place it in the Drought project folder. Now look at the “Files” pane in the bottom-right hand corner: the data set appeared! (Another way to see the list of files in the current working directory is list.files()
.)
Click on the button and select “R Script”. Save the R script as “Drought analysis” in the Drought project folder. Note that the file as a .R
extension: this is the file extension for R scripts.
Why write R scripts? R scripts facilitate easy storage, running and sharing of code. For example:
How to write R scripts? Just type commands in the window for the file, with each command on its own line.
It can be difficult to write an R script all at once. Instead, we can use the following workflow to make sure that script works as it should:
Run Selected Line(s)
(or using the Cmd-Enter
or Ctrl-Enter
shortcut). This action copies the code to the console and runs it.For all the code below, follow the workflow above, i.e. type it into the window for the R script, then run it in the console.
Download and install the readr
and tidyr
packages using install.packages("readr")
and install.packages("tidyr")
. To load our drought dataset, click on the “Import Dataset” button in the “Environment” pane, then click “From Text (readr)…”
For “File/Url”, click the “Browse” button on the right and locate the drought dataset. Within a short period of time, the “Data Preview”, “Import Options” and “Code Preview” sections are populated:
First, look at the “Code Preview” section. This is the code that R is using in order to produce the dataset seen in the “Data Preview”. It loads the readr
package, then uses the read_csv()
function to read in the data file.
Next, look at the “Data Preview” section. Notice how each column has a type associated with it. How does read_csv()
know what type each column is? From the documentation, read_csv()
looks at the first 1000 rows in the dataset and makes a guess. It’s often correct, but sometimes it’s not. For example, it gets the date columns in our dataset wrong.
To change the type, click on the little down arrow next to the column name and select “Date”. When prompted to “enter the format string”, enter %Y%m%d
for the ReleaseDate
column and %Y-%m-%d
for the ValidStart
and ValidEnd
column. Notice how both the “Data Preview” and “Code Preview” sections change.
Now, let’s go back to the “Code Preview” section. Notice how you can actually amend the code! Instead, of using the long default variable name, assign the output of read_csv(...)
to df
.
Finally, highlight the code (except for the last line) and copy it (either by Ctrl-C
or Cmd-C
, or Right click > Copy
). Click the “Cancel” button, then paste the code into the “Drought analysis.R” file (note: your filepath in read_csv(...)
may be different from mine since your data file might be stored in a different location):
library(readr)
df <- read_csv("CA Drought Data 20150924-20180924.csv",
col_types = cols(ReleaseDate = col_date(format = "%Y%m%d"),
ValidEnd = col_date(format = "%Y-%m-%d"),
ValidStart = col_date(format = "%Y-%m-%d")))
Consider what the code above does when someone else opens it on their computer. R looks for the file CA Drought Data 20150924-20180924.csv
in the present working directory. Hence, anyone using this script must make sure this file is in the present working directory; if not an error will occur when the script is run.
When writing scripts, it is good practice to leave comments in your scripts. Comments are lines within your code that are not executed; rather, they are there to tell the programmer what the code is doing. This is a good practice: it is very easy to forget what a block of code is doing after you haven’t seen it for some time.
In an R script, we start a comment with a #
sign, followed by whatever we want to say. R interprets anything on a line after the # sign as comments and will not execute it as code. We can comment our code above as follows:
# read in drought dataset
library(readr)
df <- read_csv("CA Drought Data 20150924-20180924.csv",
col_types = cols(ReleaseDate = col_date(format = "%Y%m%d"),
ValidEnd = col_date(format = "%Y-%m-%d"),
ValidStart = col_date(format = "%Y-%m-%d")))
Let’s view the dataset by typing in its name:
df
## # A tibble: 9,106 x 13
## ReleaseDate FIPS County State None D0 D1 D2 D3 D4
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2018-09-18 06001 Alame… CA 0 100 0 0 0 0
## 2 2018-09-11 06001 Alame… CA 0 100 0 0 0 0
## 3 2018-09-04 06001 Alame… CA 0 100 0 0 0 0
## 4 2018-08-28 06001 Alame… CA 0 100 0 0 0 0
## 5 2018-08-21 06001 Alame… CA 0 100 0 0 0 0
## 6 2018-08-14 06001 Alame… CA 0 100 0 0 0 0
## 7 2018-08-07 06001 Alame… CA 0 100 0 0 0 0
## 8 2018-07-31 06001 Alame… CA 0 100 0 0 0 0
## 9 2018-07-24 06001 Alame… CA 0 100 0 0 0 0
## 10 2018-07-17 06001 Alame… CA 0 100 0 0 0 0
## # ... with 9,096 more rows, and 3 more variables: ValidStart <date>,
## # ValidEnd <date>, StatisticFormatID <int>
Before we start using the dataset for our analysis, we should do some sanity checks to make sure that the data we have is indeed the data we expect. For example: California has 58 counties, so we should have all 58 counties represented. We check this using summarize
and the function n_distinct
.
# Sanity check: should have all 58 counties
library(dplyr)
df %>% summarize(distinct = n_distinct(County))
## # A tibble: 1 x 1
## distinct
## <int>
## 1 58
Let’s do another sanity check: Since we have 3 years of data, we should expect around \(52 \times 3 = 156\) entries for each county. We check this by counting the number of entries, then checking that 156
is the only value we receive among all the counts:
# Sanity check: each county should have ~156 observations
df %>% group_by(County) %>%
summarize(count = n()) %>%
distinct(count)
## # A tibble: 1 x 1
## count
## <int>
## 1 157
It’s not 156, but it’s close enough. It’s also a good sign that all the counties have the same number of observaations.
Some thoughts about the fields in our dataset:
ReleaseDate
seems to be the same as ValidStart
, and ValidEnd
is always 6 days later than ValidStart
. We can keep only the ValidStart
column and treat that as the date which the data in that row is referring to.FIPS
and StatisticFormatID
(we don’t know what they mean anyway), and State
is the same for all values in our dataset, so we can safely drop these columns.None
to D4
always add up to 100%. Since we are not going to plot the None
column, we can drop it.Let’s keep just the rows we want using select
in dplyr
. Type in the code below (notice how the code below changes the name for column ValidStart
at the same time):
# select important rows
df_levels <- df %>% select(County, Date = ValidStart, D0:D4)
dplyr
practicedf_levels %>% filter(D4 == 100)
## # A tibble: 618 x 7
## County Date D0 D1 D2 D3 D4
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alpine County 2016-03-08 0 0 0 0 100
## 2 Alpine County 2016-03-01 0 0 0 0 100
## 3 Alpine County 2016-02-23 0 0 0 0 100
## 4 Alpine County 2016-02-16 0 0 0 0 100
## 5 Alpine County 2016-02-09 0 0 0 0 100
## 6 Alpine County 2016-02-02 0 0 0 0 100
## 7 Alpine County 2016-01-26 0 0 0 0 100
## 8 Alpine County 2016-01-19 0 0 0 0 100
## 9 Alpine County 2016-01-12 0 0 0 0 100
## 10 Alpine County 2016-01-05 0 0 0 0 100
## # ... with 608 more rows
df_levels %>%
filter(D4 == 100) %>%
group_by(County) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 22 x 2
## County count
## <chr> <int>
## 1 Santa Barbara County 68
## 2 Ventura County 68
## 3 Fresno County 30
## 4 Kings County 30
## 5 Madera County 30
## 6 Mariposa County 30
## 7 Merced County 30
## 8 San Joaquin County 30
## 9 Stanislaus County 30
## 10 Tulare County 30
## # ... with 12 more rows
df_levels %>% filter(D4 == 100) %>%
distinct(County)
## # A tibble: 22 x 1
## County
## <chr>
## 1 Alpine County
## 2 Amador County
## 3 Calaveras County
## 4 El Dorado County
## 5 Fresno County
## 6 Kings County
## 7 Madera County
## 8 Mariposa County
## 9 Merced County
## 10 Mono County
## # ... with 12 more rows
df_levels %>% filter(Date == "2018-09-18") %>% arrange(desc(D0))
## # A tibble: 58 x 7
## County Date D0 D1 D2 D3 D4
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alameda County 2018-09-18 100 0 0 0 0
## 2 Butte County 2018-09-18 100 0 0 0 0
## 3 Contra Costa County 2018-09-18 100 0 0 0 0
## 4 Marin County 2018-09-18 100 0 0 0 0
## 5 San Francisco County 2018-09-18 100 0 0 0 0
## 6 San Mateo County 2018-09-18 100 0 0 0 0
## 7 Santa Clara County 2018-09-18 100 0 0 0 0
## 8 Santa Cruz County 2018-09-18 100 0 0 0 0
## 9 Sonoma County 2018-09-18 100 0 0 0 0
## 10 Yuba County 2018-09-18 100 0 0 0 0
## # ... with 48 more rows
The chart we want to make display data for just Santa Clara county, so we should use dplyr
’s filter
to get a new dataset with just the observations from Santa Clara:
# filter for just Santa Clara County
df_county <- df_levels %>% filter(County == "Santa Clara County")
Let’s plot a histogram of D0
:
library(ggplot2)
ggplot(data = df_county) +
geom_histogram(aes(x = D0))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, let’s make a line plot of D0
against Date
:
ggplot(data = df_county) +
geom_line(mapping = aes(x = Date, y = D0))
To make an area plot instead, replace geom_line
with geom_area
:
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = D0))
How can we plot D0
and D1
against Date
? We could try this:
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = D0)) +
geom_area(mapping = aes(x = Date, y = D1))
There are a number of issues with this approach:
D0
to D4
, we need 3 more lines of code. What if there were 10 drought levels instead?D0
even though the values plotted are for D0
and D1
.The reason we are facing this issue is because the values of drought level are column names. We need to reorganize the dataset such that there is a Drought level
column with values D0
to D4
, and a corresponding Percent of land area
which gives the percent of land area in that level of drought (In Hadley Wickham’s terminology, the original dataset is not “tidy” and we have to make it so.) We can use tidyr
’s gather
function to accomplish this. (See optional material in slides for more details.)
# gather drought levels
library(tidyr)
df_county <- df_county %>% gather(D0:D4, key = "Drought level",
value = "Percent of land area")
Now we can easily plot D0
to D4
on the same chart (because our column names have spaces in them, we need to surround them with backticks (`) so that R interprets the code properly):
# make area plot
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = `Percent of land area`,
fill = `Drought level`))
Let’s use scale_fill_brewer
to make the area colors more theme-appropriate, and add a title:
# make area plot
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = `Percent of land area`,
fill = `Drought level`)) +
labs(title = "Drought levels in Santa Clara County") +
scale_fill_brewer(palette = "YlOrRd") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
(scale_fill_brewer
works in the same way as scale_color_brewer
, except that it refers to the “fill” aesthetic rather than the “color” aesthetic. Both of these functions work for discrete variables; for continuous variables, use scale_fill_distiller
and scale_color_distiller
instead.)
At this point, your script should look something like Drought Analysis v1.R
.
Let’s reorganize the script to make it more readable.
library
commands are usually put right at the top.gather
statement with the select
statement, since those are all operations that give us a final dataset that is plotting-friendly. After rearrangement, the script should look like Drought Analysis v2.R
.To check that the script does what it’s supposed to do, quit RStudio, then reopen RStudio. Load the script and run all the lines of code in it. The same chart should appear.
The plot we just made was for Santa Clara County. What if we were interested in another county, say Monterey County, instead? Do we have to type up all the code again? No: we could just look for all mentions of “Santa Clara County” and change them to “Monterey County”.
While this works, it is not advised: if “Santa Clara County” was used 50 times in our script, we’d have to find all 50 instances and change them. A better way to do this would be to assign the county to a variable right at the top of the script, and have all mentions of the county in the script be replaced by this variable.
To do this, insert the following lines at the top of Drought analysis v2.R
(but don’t run them):
# PARAMETERS (to change if necessary)
county <- "Santa Clara County"
Next, replace the line of code for filtering (lines 29-33 of Drought analysis v2.R
) with the following code (but don’t run it):
# filter for just the county we want
df_county <- df_levels %>%
filter(County == county) %>%
gather(D0:D4, key = "Drought level",
value = "Percent of land area")
The other place that “Santa Clara County” appears is in the title of the plot (line 39 of Drought analysis v2.R
). Change it to the following code (see ?paste
to understand what the code is doing):
labs(title = paste("Drought levels in", county)) +
Now, if we want to get the plot for Monterey County, all we have to do is change the string that is assigned to county
and re-run the whole script.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_0.8.1 ggplot2_3.0.0 bindrcpp_0.2.2 dplyr_0.7.6
## [5] readr_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 RColorBrewer_1.1-2 pillar_1.3.0
## [4] compiler_3.5.1 plyr_1.8.4 bindr_0.1.1
## [7] tools_3.5.1 digest_0.6.15 evaluate_0.10.1
## [10] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1
## [13] rlang_0.2.1 cli_1.0.0 yaml_2.1.19
## [16] withr_2.1.2 stringr_1.3.1 knitr_1.20
## [19] hms_0.4.2 rprojroot_1.3-2 grid_3.5.1
## [22] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2
## [25] fansi_0.2.3 rmarkdown_1.10 purrr_0.2.5
## [28] magrittr_1.5 backports_1.1.2 scales_0.5.0
## [31] htmltools_0.3.6 assertthat_0.2.0 colorspace_1.3-2
## [34] labeling_0.3 utf8_1.1.4 stringi_1.2.3
## [37] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4